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ABSTRACT 



Given some consensus that statistical significance tests are 
broken, misused, or at least have somewhat limited utility, the focus of 
discussion within the field ought to move beyond additional bashing of 
statistical significance tests, and toward more constructive suggestions for 
improved practice. Five suggestions for improved practice are recommended: 

(1) required reporting of effect sizes; (2) reporting of effect sizes in an 
interpretable manner; (3) explicating the values that bear on results; (4) 
providing evidence of result replicability; and (5) reporting confidence 
intervals. Although the five recommendations can be followed even if 
statistical significance tests are reported, social science will proceed most 
rapidly when research becomes the search for replicable effects aOCcwOixnv in 
magnitude in the context of both the inquiry and personal or social values . 
(Contains 1 table and 74 references.) (Author/SLD) 
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Statistical Significance -2- 
Abstract 

Given some consensus that statistical significance tests are 
broken, misued, or at least have somewhat limited utility , the 
focus of discussion within the field ought to move beyond 
additional bashing of statistical significance tests, and toward 
more constructive suggestions for improved practice. Five 
suggestions for improved practice are recommended; these involve 
(a) required reporting of effect sizes, (b) reporting of effect 
sizes in an interpretable manner, (c) explicating the values that 
bear upon results, (d) providing evidence of result replicability, 
and (e) reporting confidence intervals. Though the five 
recommendations can be followed even if statistical significance 
tests are reported, social science will proceed most rapidly when 
research becomes the search for replicable effects noteworthy in 
magnitude in the context of both the inquiry and personal or social 



values. 



Statistical Significance -3- 
A few years ago Pedhazur and Schmelkin (1991) asserted that 
"probably very few methodological issues have generated as much 
controversy" (p. 198) as have the use and interpretation of 

statistical significance tests. These tests have certainly proven 
surprisingly resistant to repeated efforts to "to exorcise the null 
hypothesis" (Cronbach, 1975, p. 124) . Particularly noteworthy 
among the historical efforts to accomplish the exorcism have been 
works by Rozeboom (1960) , Morrison and Henkel (1970) , Carver 
(1978), Meehl (1978), Shaver (1985), and Oakes (1986). The entire 
Volume 61, Number 4 issue of the Journal of Experimental Education 
was devoted to these themes. Yet, notwithstanding the long-term 
availability of these publications, even today some psychologists 
still do not understand what statistical significance tests do and 
do not do. 

In a public-domain brief digest disseminated as a class 
handout by the U.S. Department of Education Educational Resources 
Information Center, Thompson (1994a) provided some simple tests of 
understanding of what Ecalculated actually evaluates: 

In which one of each of the following [three] pairs 
of studies will the Ecalculated be smaller? 

— In two studies each involving three groups of 
subjects each of size 30, in one study the 
means were 100, 100, and 90, and in the second 
study the means were 100, 100, and 100. 

—In two studies each comparing the standard 
deviations (SD) of scores on the dependent 
variable of two groups of subjects, in both 
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studies SD t = 4 and SD 2 = 3, but in study one 
the sample sizes were 100 and 100, while in 
study two the samples sizes were 50 and 50. 

— In two studies involving a multiple regression 
prediction of Y using predictors X w X 2 , and X 3 , 
and both with samples sizes of 75, in study one 
R 2 = .49 and in study two R 2 = .25. (p. 5) 

These judgments do not require calculations or additional 
information. However, making such judgments does require a common- 
sense understanding of what statistical significance tests are all 
about. 1 

It is not clear how well most authors would do on the previous 
three-item evaluation. Many of us continue to prefer "investing... 
[these tests] with what appear to be magical powers" (Pedhazur & 
Schmelkin, 1991, p. 198). And some of us try to use p values to 
cling to a mantle of unattainable objectivity. 

The use of statistical tests has recently stimulated yet more 
controversy. Contemporary commentaries include those provided by 
Hunter (1997), Kirk (1996), Schmidt (1996), and Thompson (1996, 
1997) . The less positive treatments of statistical significance 
tests have also provoked reactions from test advocates (cf. Chow, 
1988; Frick, 1996; Hagen, 1997; Greenwald, Gonzalez, Harris & 
Guthrie, 1996; Robinson & Levin, 1997). Yet even Frick (1996) 
acknowledged that critics of conventional practices "usefully point 



l For each of the three pairs of studies, the first study within each pair 

has a smaller Pcalculated value, if conventional nil null hypotheses (i.e., HqS Hi 
= Mj = M 3 ; HqS SD t = SD 2 ; and R 2 = 0) are used. 
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out the limitations of null hypothesis testing" (p. 388) . 

Given growing consciousness regarding these limitations, the 
APA Board of Scientific Affairs recently named a Task Force on 
Statistical Inference (Shea, 1996) . The APA Task Force is charged 
with recommending policies and practices leading to more informed 
and thoughtful statistical analyses, including those involving the 
use of statistical significance tests. 

Articles within the American Psychologist , published on a 
seemingly periodic basic, have especially informed the movement of 
the field as regards statistical significance testing. Table 1 
lists some of these articles, and also reports citation frequencies 
for the articles as of 1996. These American Psychol ogist articles, 
and the related comments published within the journal, have 
considerably influenced psychology and the social sciences more 
generally. For example, Roger Kirk (1996) characterized the two 
American Psychologist articles by Cohen as "classics," and argued 
that "the one individual most responsible for bringing the 
shortcomings of hypothesis testing to the attention of behavioral 
and educational researchers is Jacob Cohen" (p. 747) . 

INSERT TABLE 1 ABOUT HERE 

The present paper briefly reviews some of the consensus that 
has arisen or seems to be occurring as regards the use and limits 
of statistical significance tests. However, the present treatment 
also explores both (a) recommendations involving changes in 
research practices and editorial policies and (b) related issues 
that the field has yet to resolve. Given some consensus that 
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Statistical Significance -6- 
statistical significance tests are broken, misued, or at least have 
somewhat limited utility, the focus of discussion within the field 
ought to move beyond additional bashing of statistical significance 
tests, and toward more constructive suggestions for improved 
practice. 

Emerging Consensus 

The field appears to have achieved or is approaching consensus 
regarding certain limitations of statistical significance tests. At 
least three noteworthy realizations can be briefly cited. 

Result Effect Size 

First, researchers have recognized that p values are not 
useful as Indices of study effect sizes . The calculated p values 
in a given study are a function of several study features, but are 
particularly influenced by the confounded, joint influence of study 
sample size and study effect sizes. Because p values are confounded 
indices, in theory 100 studies with varying sample sizes and 100 
different effect sizes could each have the same single Calculated / 
and 100 studies with the same single effect size could each have 

100 different values for Calculated* 

This realization led to an important change in the fourth 
edition of the American Psychological Association style manual 
(APA, 1994) . The manual noted that 

Neither of the two types of probability values 
[statistical significance tests] reflects the 
importance or magnitude of an effect because both 
depend on sample size. . . You are [therefore] 
encouraged to provide effect-size information. (APA, 
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1994, p. 18, emphasis added) 

Result Importance 

Second, more and more researchers and editors have come to 
recognize that p values do not evaluate result importance. 
Therefore, p values cannot be used as an effective vehicle for 
escaping disagreement and confrontation regarding our subjective 
judgments of the worth of our results. As Thompson (1993) noted, 
importance is a guestion of human values, and math 
cannot be employed as an atavistic escape (a la 
Fromm's Escape from Freedom ^ from the existential 
human responsibility for making value judgments. If 
the computer package did not ask you your values 
prior to its analysis, it could not have considered 
your value system in calculating p's, and so p's 
cannot be blithely used to infer the value of 
research results, (p. 365) 

Result Replicability 

Third, researchers have recognized that p calculated values 
are not informative regarding either probable population values or 
the likelihood of result replication in future samples (Thompson, 
1996) . As Cohen (1994) made so clear, these calculations presume 
that the null hypothesis exactly describes the population, and then 
indicate the probability of the sample results (or of sample 
results even more disparate from the null than those in the actual 
sample) , given the sample size. 

But what we want to know is the population parameters, given 
the statistics in the sample and the sample size. This interest in 
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true population values steins from a desire to avoid the discovery 
of cold fusion, which leads to a single jubilant conference 
experience, followed by a lifetime of being shunned at all 
remaining professional meetings- If we could infer the population 
parameters, given the sample statistics and sample size, then we 
might have some confidence that future research would yield sample 
statistics similar to those in our own sample. 

Unfortunately, the direction of the inference in inferential 
statistics is from the population and to the sample, and no t from 
the sample to the population (Thompson, 1997) . Thus Cohen (1994) 
concluded that the statistical significance test "does not tell us 
what we want to know, and we so much want to know what we want to 
know that, out of desperation, we nevertheless believe that it 
does! " (p. 997) . 

Recommended Chancres in Practice 
A few scholars have called for the banning of statistical 
significance tests (cf. Carver, 1978, 1993). However, the fact 

that many psychologists misinterpret statistical significance tests 
is not a reasonable warrant for banning these tests. As Strike 
(1979) explained, "To deduce a proposition with an 'ought' in it 
from premises containing only 'is' assertions is to get something 
in the conclusion not contained in the premises, something 
impossible in a valid deductive argument" (p. 13). In logic this 
fallacy is called a "should/ would" or "is/ought" error (Hudson, 
1969) . 

But more and more researchers also now realize that "virtually 
any study can be made to show [statistically] significant results 

O 
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Statistical Significance -9- 
if one uses enough subjects" (Hays, 1981, p. 293) . This means that 
Statistical significance testing can involve a 
tautological logic in which tired researchers, 
having collected data from hundreds of subjects, 
then conduct a statistical test to evaluate whether 
there were a lot of subjects, which the researchers 
already know, because they collected the data and 
know they're tired. (Thompson, 1992b, p. 436) 

Consequently, attention has now turned toward ways to improve 
practice. Five potential improvements in practice are suggested 
here. 

Effect Size Reporting 

Empirical studies of articles published since 1994 in 
psychology, counseling, special education, and general education 
suggest that merely "encouraging" effect size reporting (APA, 1994) 
has not appreciably affected actual reporting practices (e.g. , 
Kirk, 1996; Snyder & Thompson, 1997; Thompson & Snyder, in press-a, 
in press-b; Vacha-Haase & Nilson, in press). Apparently, when it 
comes to reporting and interpreting effect sizes, many are called 
but few choose to be chosen. Consequently, editorial policies at 
some journals now require authors to report and interpret effect 
sizes (Heldref Foundation, 1997; Thompson, 1994b; see also Loftus, 
1993, and Shrout, 1997). 

Effect sizes are important to report and interpret for at 
least two reasons. First, these indices can help inform judgment 
regarding the practical or substantive significance of results. 
Statistical significance tests do not bear upon the noteworthiness 
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of results, because improbable events are not necessarily important 
(see Shaver's (1985) classic example), and because "if the null 
hypothesis is not rejected, it is usually [only] because the N is 
too small" (Nunnally, 1960, p. 643). 

Second, reporting effect sizes facilitates the meta-analytic 
integration of findings across a given literature. People who 
incorrectly believe, either consciously or unconsciously, that 
statistical significance tests evaluate the probability of 
population parameters can exaggerate the importance of a single 
study, because the study then generalizes to the population. 
Persons who recognize the limits of these statistical tests realize 
that most single studies are important primarily only as building 
blocks within a cumulative body of evidence. As Schmidt (1996) 
noted: 

Meta-analysis... has revealed how little information 
there typically is in any single study. It has shown 
that, contrary to widespread belief, a single 
primary study can rarely resolve an issue or answer 
a question, (p. 127) 

Reporting effect sizes helps meta-analysts more easily and more 
accurately synthesize findings, because the analyst can then avoid 
using more approximate effects computed based on sometimes tenuous 
statistical assumptions. 

Of course, effect size is no more a panacea than is a 
statistical significance test, for two reasons noted by Zwick 
(1997) . First, because human values are also not part of the 
calculation of an effect size, any more than values are part of the 
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calculation of jo, "largeness of effect does not guarantee practical 
importance any more than statistical significance does" (p. 4) . 

Second, some researchers seem to have adopted Cohen's (1988) 
definitions of small, medium and large effects with the same 
rigidity that "a=.05" has been adopted. Such rigidity is 
inappropriate. Cohen (1988) only intended these as impressionistic 
characterizations of result typicality across a diverse literature, 
and not as rigid universal criteria. However, some empirical 
studies suggest that the characterization is reasonably accurate 
(Glass, 1979; Olejnik, 1984) at least as regards a literature 
historically built with a bias against statistically non- 
significant results (Rosenthal, 1979) . 

Notwithstanding these caveats , it is suggested that all 
authors of quantitative studies should report and interpret effect 
sizes. Because merely encouraging these practices has to date had 
little or no effect, at some point it may become necessary to 
require that effect sizes are reported. Of course, a requirement 
that effect sizes be reported does not inherently require that a 
whole new system of statistical analyses be invoked; all our 
classical analytic methods can be used to yield both Ecalculated anc * 
effect size values, even though the methods have traditionally been 
used only for the first purpose. 

Effect Size Interpretabilitv 

There are myriad effect sizes from which the researcher can 
choose. Useful reviews of the choices have been provided by Kirk 
(1996), Snyder and Lawson (1993), and Friedman (1968), among 




others . 
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Effect sizes can be categorized into two broad classes: 
variance-accounted-f or measures (e.g., R 2 , eta 2 ) and standardized 
differences (e.g., Cohen's d, Hedges' g) [Kirk (1996) identifies a 
third, "miscellaneous" class.]. Variance-accounted-for indices can 
be computed in all classical statistical analyses because all 
analyses are correlational, even though some designs are 
experimental and some are not (Knapp, 1978; Thompson, in press) . 

Furthermore, effect sizes can be further subdivided as being 
either "uncorrected" (e.g., R 2 , eta 2 ) or "corrected" (e.g., 

adjusted R 2 , omega 2 ) . Because all conventional analyses are least- 
squares correlational methods that capitalize on all sample 
variance, including the sampling error variance unique to the 
sample, all uncorrected variance-accounted-for statistics are 
positively biased and overestimate population effects. This bias 
can be statistically removed via the corrected effect size formulas 
which estimate the influence of the three major factors 
contributing to sampling error: 

1. Samples with smaller sample sizes tend to have more 
sampling error; 

2. Studies with more variables tend to have more sampling 
error ; and 

3. Samples from populations with larger variance-accounted- 
for parameters tend to have less sampling error. 

Regarding this last influence, the case can be made clear at the 
extreme for a study involving the statistic, r 2 . If the population 
parameter is 1.0, it is impossible to draw a sample that yields an 
inaccurate effect size, even if the sample is only two or three 
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pairs of data on the two variables. 

The field has not yet established a single preferred effect 
size, a preference for variance-accounted-for as against 
standardized differences indices, or a preference for corrected as 
against uncorrected indices. It is doubtful that the field will 
ever settle on a single index to be used in all studies, given that 
so many choices exist and because the statistics can usually be 
translated into approximations across the two major classes. 
However, some pluses and minuses for both variance-accounted-for 
and standardized differences indices can be noted. 

On the one hand, variance-accounted-for indices do have the 
benefit of reinforcing the realization that all classical analyses 
are correlational (Knapp, 1978; Thompson, in press). This may 
minimize the autonomic choice of ANOVA as an analytic method based 
on an unconscious association of ANOVA with the ability to make 
causal inferences (cf . Humphreys & Fleishman, 1974) . 

On the other hand, standardized difference effect sizes (e.g., 
the difference of the experimental group mean minus the control 
group mean divided by the control group standard deviation) may be 
more directly interpretable. For example, Saunders, Howard and 
Newman (1988) argued that a variance-accounted-for effect is "still 
cast in a language that was foreign to (and unusable by) 
practitioners" (pp. 207-208) ; a variance-accounted-for 2% effect 
usually must be expressed in the metric of an outcome variable to 
be meaningful. 

However, not all studies involve experiments or a focus on 
means, and the use of standardized differences can seem stilted in 
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such contexts. Thus, there are no clear-cut choices of an optimal 
effect size, or even a class of effect indices. 

But it does seem reasonable to expect at a minimum that effect 
sizes should always be presented in an accessible metric (e.g., 
years added to longevity, on the average, from not smoking; median 
number of additional months due to an intervention that Alzheimer's 
patients were able to live without institutionalization) . Several 
clinical disciplines have explored innovative ways to meet these 
requirements (see, for example, the half-dozen articles in a 1988 
special issue of Behavioral Assessment , including the report by 
Saunders, Howard and Newman (1988)). But continued development of 
more effective ways to communicate effects remains warranted. 
Values Explication 

Cohen's (1988) typicality characterizations are not suitable 
as rigid criteria for noteworthiness, nor were they meant to be so 
used. The only suitable criteria for evaluating result value (a) 
must be informed by the personal, idiosyncratic values of each 
researcher and (b) must take into account the particular context of 
a given study. Regarding the first point, Huberty and Morris 
(1988, p. 573) noted that, "As in all of statistical inference, 
subjective judgment cannot be avoided. Neither can reasonableness!" 

Regarding the context of a given study, a 2% variance- 
accounted— for effect size will not be noteworthy to most 
researchers (or to most readers) in the context of a study like one 
I once read, titled "Smiling and Touching Behavior of Adolescents 
in Fast Food Restaurants." However, Gage (1978) pointed out that 
the relationship between cigarette smoking and lung cancer involves 
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roughly this same effect size, and noted that: 

Sometimes even very weak relationships can be 
important... [0]n the basis of such correlations, 
important public health policy has been made and 
millions of people have changed strong habits, (p. 

21 ) 

Certainly a small variance— accounted— for effect size involving 
highly valued outcomes, such as longevity, can be noteworthy. But 
since the judgments of result noteworthiness are inherently value 
driven, and are "on the average," even here some may reach a 
seemingly reasoned decision that the effect is not noteworthy, or 
at least not noteworthy enough to merit changed behavior. 

Many scientists will probably feel uncomfortable declaring 
their effects in a meaningful metric and then explicating the 
associated personal or societal values that make these effects 
noteworthy. Declarations that "my results were [statistically] 
significant" will have to be replaced with, "This intervention 
extends life expectancy, on the average, by 1.4 years, and given my 
valuing of life, I believe this result is noteworthy." 

Historically, social scientists have used p statistics as a 
way to finesse values differences, because conflicting values of 
different people are not readily reconcilable. Nevertheless, 
researchers should be expected to declare the values that make 
their effects noteworthy. 

Normative practices for evaluating such assertions will have 
to evolve. Research results should not be published merely because 
the individual researcher thinks the results are noteworthy. By the 
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same token, editors should not quash research reports merely 
because they find explicated values unappealing. These resolutions 
will have to be formulated in a spirit of reasoned comity. 

But we also must realize that our historical reliance on p 
values as a way to avoid value assertions led only to feigned 
objectivity, and not to real objectivity. This feigned objectivity 
was built on the edifice of misinterpretation of what statistical 
significance tests really do. 

Evidence of Replicability 

The cumulation of knowledge about relationships that recur 
under specified conditions is the sine qua non of science for those 
psychologists who believe that such laws can reasonably be 
formulated. For these psychologists evidence of result 
replicability is critical for creating a warrant that results are 
noteworthy . 

The required nature of this warrant has received too little 
attention in an era when statistical significance tests were 
thought to evaluate result replicability, when these tests were 
thought to evaluate (rather than merely to presume) selected 
population parameters. Several vehicles for establishing these 
warrants can be noted. 

One warrant involves an important contribution that Jacob 
Cohen made in his 1994 article; this very important contribution 
has not been as widely noticed as might be hoped (Hagen, 1997) . 
Cohen (1994) carefully distinguished the general class of "null" 
hypothesis tests from a subclass of null tests he labelled the 
"nil" hypothesis test. [A related important distinction is what 
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Meehl (1997) has described as "strong" versus "weak" null 
hypothesis refutation.] 

For Cohen, a nil null hypothesis always specifies zero 
difference or zero relationship (e.g., for the especially 
inappropriate test of a reliability statistic, H 0 : r^ = 0; H A : r^ 
^ 0) , while other non-nil null hypotheses may test an alternative 
hypothesis such as H A : r^ > .7). Cohen's important distinction 
recognizes that a "null hypothesis means the hypothesis to be 
nullified, not necessarily a hypothesis of no difference" (Chow, 
1988, p. 105) . 

Some specific null must be presumed true in the population, or 
otherwise infinitely many parameters are possible and the Pcalculated 
for the sample results becomes indeterminate (Thompson, 1996) . Most 
researchers use a nil hypothesis as the null partly because this is 
what most computer packages assume, and partly because methodology 
for invoking non-nil null hypotheses has some "complexity, and it 
is not yet readily applicable in many designs" (Dar, Serlin, & 
Omer, 1994, p. 81). 

The mindless use of the nil hypothesis obviates the necessity 
prospectively to extrapolate thoughtful expected effect sizes from 
prior literature as part of study design. Furthermore, the 
interpretation of "[statistical] significance" as indicating result 
value means that some researchers do not retrospectively interpret 
their study effects in the context of specific previous findings. 
These failures are most unfortunate, because the prospective and 
retrospective use of effects from prior studies is itself a check 
on the replicability of results in a given inquiry. 
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Empirical evidence for result replicability can either be 
"external" or "internal" (Thompson, 1993, 1996). "External" 

replication studies invoke a new sample measured at a different 
time and/or a different location. Such replications have 
unfortunately been undervalued (Robinson & Levin, 1997), perhaps 
because some researchers thought they were already testing 
replicability by conducting statistical significance tests. 

"Internal" replicability analyses use the sample in hand to 
combine the participants in different ways to try to estimate how 
much the idiosyncracies of individuality within the sample have 
compromised sample results. The major "internal" replicability 
analyses are cross-validation, the jackknife, and the bootstrap 
(Diaconis & Efron, 1983); the logics are reviewed in more detail 
elsewhere (cf. Thompson, 1993, 1994c). 

"Internal" evidence for replicability is never as good as an 
actual replication (Robinson & Levin, 1997; Thompson, 1997), but is 
certainly better than presuming that a statistical significance 
test assures result replicability. And such "internal" 
replicability evidence is useful for researchers who for practical 
reasons cannot externally replicate all results prior to graduation 
or tenure review. 

It is important that these logics when used to evaluate result 
replicability are not confused with other uses of the same logics 
(Thompson, 1993). For example, the inferential use of the bootstrap 
involves using the bootstrap to estimate a sampling distribution 
when the sampling distribution is not known or assumptions for the 
use of a known sampling distribution cannot be met. The descriptive 
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use of the bootstrap looks primarily at the variance in parameter 
estimates across many different combinations of the participants. 

The inferential application requires considerably more "re- 
samples" (see Thompson, 1994c) than the descriptive application 
recommended here. This is because the inferential focus is on the 
tails of the estimated sampling distribution (e.g. , the 95th 
percentile of the distribution, for a one-tailed statistical 
significance test) , rather than the descriptive focus on the 
standard deviation (i.e., the "standard error") of the sampling 
distribution. Participants in the tails of the sampling 
distribution are rarer, and therefore many more bootstrap re- 
samples are required to estimate these very small or large 
percentiles . 

The field has not yet resolved all the issues involved in 
establishing a sufficient warrant for result replicability, again, 
perhaps because some authors incorrectly assumed that statistical 
tests evaluated the population. The relevant software to conduct 
"internal" bootstrap analyses is already available (e.g., Lunneborg 
(1987) for univariate applications, and Thompson (1992a, 1995) for 
multivariate applications) . Because replicability evidence is 
critical to the cumulation of knowledge, more authors should be 
expected to provide some evidence of result replicability . 
Reporting Confidence Intervals 

Various scholars have recommended that confidence intervals 
should be used to replace or supplement statistical significance 
tests (e.g., Dar, Serlin, & Omer, 1994; Meehl, 1997; Schmidt, 1996; 
Serlin, 1993) . However, researchers using confidence intervals must 
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remember that "the interval endpoints are themselves random 
variables" (Zwick, 1997, p. 5) also estimated using sample data. 
Furthermore, researchers who mindlessly interpret confidence 
intervals only against the standard of whether the interval 
subsumes zero are doing nothing more than a mindless "nil" 
hypothesis test (Cortina & Dunlap, 1997) . 

However, confidence intervals do have one very appealing 
feature, as Schmidt (1996) made clear. Even if all the research in 
an area of inquiry was based on radically erroneous estimates of 
parameters (and even if these a priori estimates were used in 
specifying non-nil null hypotheses) , the parameter would still 
emerge across studies as a series of overlapping confidence 
intervals converging on the same parameter. 

The use of confidence intervals might also mitigate against 
the current bias in the literature (a) first favoring the 
publication of Type I errors and (b) then disfavoring publication 
of replication studies revealing the previously published Type I 
error. Setting alpha at a small level does not prevent any Type I 
errors; rather, the percentage of such errors is capped at a small 
proportion. But some such errors will unavoidably occur. Because 
the literature has been biased in favor of statistically 
significant results (Rosenthal, 1979) , such Type I errors are 
afforded priority for publication, but the replications with 
statistically non-significant results will compete at a 
disadvantage for journal space, and so the self-correction of 
science through replication will be impeded. Greenwald (1975) cited 
relevant actual examples. 
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A focus on consistency of findings across studies can be 
achieved with confidence intervals interpreted in relation to each 
other, rather than against the nil standard of a zero value. 
Therefore, it is suggested that more authors should report 
confidence intervals as part of their results . 

Summary 

Kirk (1996) recently noted that, "Our science has paid a high 
price for its ritualistic adherence to null hypothesis significance 
testing" (p. 756) . The overuse and misinterpretation of 
statistical tests has been frequently decried as well in 
literatures other than psychology, including medicine (Kraemer, 
1992; Pocock, Hughes & Lee, 1987), business (Sawyer & Peter, 1983), 
occupational therapy (Ottenbacher , 1984) , and speech and hearing 
(Young, 1993). Nevertheless, the use of statistical significance 
tests remains common, and some empirical studies reflect even an 
increased use of these methods (Parker, 1990) 1 

Many have marveled at the robustness of the statistical 
significance logic against the application of the wooden stake 
through the heart. For example, Falk and Greenbaum (1995) noted: 
We have shown the compelling nature and the 
robustness of that illusion [that statistical 
significance tests give us the information we need] . 

A massive educational effort is required to 
eradicate the misconception and extinguish the 
mindless use of a procedure that dies hard. (p. 94) 

And Harris (1991) observed, "it is surprising that the dragon will 
not stay dead" (p. 375) . 
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Frick (1996) cited an anonymous reviewer of his defense of 
statistical significance testing who argued that, M A way of 
thinking that has survived decades of ferocious attacks is likely 
to have some value" (p. 379). Of course, this view presumes a 

completely rationale model of science in which scientists are 
objective, dispassionate logicians never acting merely out of 
habit; the view also presumes that scientists are always anxious to 
admit past errors publicly made in the articles they themselves 
published over the courses of their careers. 

Five specific suggestions for improved analytic practice were 
presented here. It should be noted that these suggestions can be 
followed even by those psychologists still employing conventional 
statistical significance tests. But social science will proceed 
most rapidly when research becomes the search for replicable 
effects noteworthy in magnitude in the context of both the inquiry 
and personal or social values. 
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